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Abstract 

Compressed sensing is a technique to sample compressible signals below the Nyquist rate, whilst still allowing 
near optimal reconstruction of the signal. In this paper we present a theoretical analysis of the iterative hard 
thresholding algorithm when applied to the compressed sensing recovery problem. We show that the algorithm has 
the following properties (made more precise in the main text of the paper) 

• It gives near-optimal error guarantees. 

• It is robust to observation noise. 

• It succeeds with a minimum number of observations. 

• It can be used with any sampling operator for which the operator and its adjoint can be computed. 

• The memory requirement is linear in the problem size. 

• Its computational complexity per iteration is of the same order as the application of the measurement operator 
or its adjoint. 

• It requires a fixed number of iterations depending only on the logarithm of a form of signal to noise ratio of 
the signal. 

• Its performance guarantees are uniform in that they only depend on properties of the sampling operator and 
signal sparsity. 

I. Introduction 

For more than fifty years, the Nyquist-Shannon [1] [2] sampling theorem was generally used as the foundation 
of signal acquisition systems. Using this theory, it was common held belief that signals have to be sampled at 
twice the signal bandwidth. Whilst this is true for general band-limited signals, this theory does not acount for 
additional signal structures that might be known a priory. The recently emerging field of compressed sensing, 
[3], [4], [5], [6] and [7] and the related theory of signals with a finite rate of innovations [8], start from another 
premise. In compressed sensing, signals are assumed to be sparse in some transform domain. This sparsity constraint 
significantly reduces the size of the set of possible signals compared to the signal space dimension. This is easily 
visualised if one imagines the set of all images (say of dimension 256 x 256) with pixel values in the range from 
zero to one. To get an idea of how scant the set of true or interesting images is in this space, randomly draw images 
from this set, that is, draw the pixels from uniform distributions. One could spend the rest of once life doing this, 
without encountering anything remotely resembling an image. In other words, the set of images generally of interest 
(such as you recent holiday snapshots) occupy a minute subset of all possible images. 

This implies that sparse signals have an information content much smaller than that implied by the Nyquist rate. 
Instead of requiring 65536 numbers to represent the 256 x 256 image, if the image has a sparse representation, 
it is possible to represent the same image using many fewer bits of information. This property of many signals 
is exploited in compressed sensing. Instead of taking samples at the Nyquist rate, compressed sensing uses linear 
sampling operators that map the signal into a small (compared to the Nyquist rate) dimensional space. This process 
is a combination of sampling and compression, hence the name compressed sensing or compressive sampling. 
Contrary to the Nyquist-Shannon theory, in compressed sensing, reconstruction of the signal is non-linear. One of 
the important contributions of the seminal work by Candes, Romber, Tao [4], [5], [6] and Donoho [7], was to show 
that linear programming algorithms can be used under certain conditions on the sampling operator to reconstruct 
the original signal with high accuracy. 

Another set of algorithms, which could be shown to efficiently reconstruct signals from compressed sensing 
observations are greedy methods. A by now traditional approach is Orthogonal Matching Pursuit [9], which was 
analysed as a reconstruction algorithm for compressed sensing in [10]. Better theoretical properties were recently 
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proven for a regularised Orthogonal Matching Pursuit algorithm [11], [12]. Even more recently, the Subspace Pursuit 
[13] and the nearly identical Compressive Sampling Matching Pursuit (CoSaMP) [14] algorithms were introduced 
and analysed for compressed sensing signal reconstruction. Of all of these methods, CoSaMP currently offers the 
most comprehensive set of theoretic performance guarantees. It works for general sampling operators, is robust 
against noise and the performance is uniform in that it only depends on a property of the sampling operator and 
the sparsity of the signal, but not on the size of the non-zero signal coefficients. Furthermore, it requires minimal 
storage and computations and works with (up to a constant) a minimal number of observations. 

In a previous paper [15], iterative hard thresholding algorithms were studied. In particular, their convergence to 
fixed points of £q regularised (or constrained) cost functions could be proven. In this paper, it is shown that one of 
these algorithms (termed from now on IHTg) has similar performance guarantees to those of CoSaMP. 

A. Paper Overview 

Section |ll] starts out with a definition of sparse signal models and a statement of the compressed sensing problem. 
In section JIIJ we then discuss an iterative hard thresholding algorithm. The rest of this paper shows that this 
algorithm is able to recover, with high accuracy, signals form compressed sensing observations. This result is 
formally stated in the theorem and corollary in the first subsection of section |IVl The rest of section |IV] is devoted 
to the proof of this theorem and its corollary. The next section, section |Vl takes a closer look at a stopping criterion 
for the algorithm, which guarantees a certain estimation accuracy. The results of this paper are similar to those for 
the CoSaMP algorithm of [14] and a more detailed comparison is given in section [Vll 

B. Notation 

The following notation will be used in this paper x is an Af -dimensional real or complex vector y is an 
dimensional real or complex vector. <I> will denote an M x N real or complex matrix, whose transpose (hermitian 
transpose) will be denoted by <I>^. Many of the arguments in this paper will use sub-matrixes and sub-vectors. The 
letters F, B and A will denote sets of indices that enumerate the columns in <I> and the elements in the vectors y. 
Using these sets as subscripts, e.g., <I>r or yr, we mean matrixes (or vectors) formed by removing all but those 
columns (elements) from the matrix (vector) other than those in the set. We also have occasion to refer to quantities 
in a given iteration. Iterations are counted using n or k. For sets, we use the simplified notation F" whilst for 
vectors, the iteration count is given in square brackets y'"]. For the definition of the restricted isometry property 
used in the research literature, we use the Greek letter 5s to denote the restricted isometry constant. However, in 
this paper, we use a slightly modified definition of the property. The associated restricted isometry constant will be 
labelled by fig throughout. 

The following norms are used repeatedly. || • II2 is the Euclidean vector norm or, for matrixes, the operator norm 
from I2 to £2- We will also need the vector li norm || • The notation || • ||o will denote the number of non-zero 
elements in a vector. For a general vector y we use y'* to be any of the best s term approximations to y. The 
difference between the two will be y^ = y — y''. The support, that is the index set labeling the non-zero elements 
in y*, is defined as F* = suppjy'*} and similarly, F" = suppjyt"!}, where yl"] is the estimation of y in iteration 
n. Finally, the set i?" = F* |J F" is a superset of the support of the error r'"] = y'* — y'"^- Set difference is denoted 
using • \ •. 

II. Sparsity and Compressed Sensing 

A. Sparse Signal Models 

In this section we formalise the notion of sparsity and sparse signal models. A vector will be called s-spare if 
no more than s of its elements have non-zero values. We will talk of a best s-sparse approximation to a vector y 
to mean any one of the s-sparse vectors y* that minimise ||y* — y||2- 

Most signals in the real world are not exactly sparse. Instead, they are often well approximated by an s-sparse 
signal. A good example are images, whose wavelet transform has most of its energy concentrated in relatively 
few coefficients. To model such behaviour, we use the following notion. Assume the elements in a vector y are 
reordered such that > A signal is called p-compressible, if, for some constant R, the coefficients satisfy 

the following property 

\y^\ < Ri-^^", (1) 
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that is, the reordered coefficients decay with a power law. The importance of these p-compressible signals is that they 
can be well approximated using exact sparse signals. Let be any best s-term approximation of a p-compressible 
signal y, then it is easy to show that 

||y-y^||2<ci?sV2-Vf (2) 

and 

\\y-y'\\i<cRs^-^/p, (3) 

where c are different constants depending only on p, but not on s. Note that the above two terms bound the error 
in terms of the ii as well as the £2 norm. These two norms will appear in the main result derived in this paper 
and can therefore be used to derive performance guarantees for the recovery of ^^-compressible signals. 



B. Compressed Sensing 

We assume signals y to be representable as real or complex vectors of finite length. Compressed sensing samples 
such signals, using a hnear mapping $ into an M dimensional real or complex observation space. In matrix notation, 
the observed samples x are 

X = $y + e. (4) 

Here, $ is the linear sampling operator and e models possible observation noise due to, for example, sensor noise 
or quantisation errors in digital systems. 

In compressed sensing, two main problems have to be addressed. The first problem, not studied in detail in this 
paper, is to design measurement systems $ that possess certain desirable properties (such as the restricted isometry 
property of the next subsection), which allow for an efficient estimation of y. The second problem, which is at the 
focus of this paper, is the study of concrete algorithms for the efficient estimation of y, given only x and 



C. The Restricted Isometry Property 

The analysis of algorithms for compressed sensing relies heavily on the following property of the observation 
matrix A matrix $ satisfies the Restricted Isometry Condition (RIP) [4] if 

(i-'5.)||y||2< ll^y||i<(i + '5.)l|y||i (5) 

for all s-sparse y. The restricted isometry constant Sg is defined as the smallest constant for which this property 
holds for all s-sparse vectors y. Instead of using the above restricted isometry property, we will use a re-scaled 
matrix $ = which satisfies the following non-symettic isometty property, which is equivalent to the RIP 

defined above. 

(i-/3.)||y||i< ||$y||i< ||y||i (6) 

for all s-sparse y. Now (3s = 1 — We will say that for a matrix $ the RIP holds for sparsity s, if /3s < 1. 
Throughout this paper, when referring to the RIP, we mean in general this slightly modified version for which we 
always use the letter (3. If we need to refer to the standard RIP, for the non-normalised matrix $, we use the letter 
5. This is only required in order to compare our results to others in the literature. 

We also need the following properties of the RIP derived in for example [14]. Note that the RIP gives an upper 
bound on the largest and a lower bound on the s-largest singular values of all sub-matirxes of $ with s columns. 
Therefore, for all index sets F and all $ for which the RIP holds with s = |r|, the following inequahties hold 
trivially 

||$fx||2 < ||x||2, (7) 

(1 - /3|r|)||yr||2 < ll^r^ryrlb < ||yr||2 (8) 

The following two properties are at the heart of the proof of the main result of this paper and we derive these 
therefore here. 

Lemma 1: For all index sets F and all <1> for which the RIP holds with s = |F| 

||(I-$f$r)yr||2</3.||yr||2. (9) 



4 



Proof: The RIP guarantees that the eigenvalues of the matrix $p$r fall within the interval [1 — /3s, 1]. This 
can easily be seen to imply that the matrix I — $p$r has eigenvalues in the interval [0, /3s], which proves the 
lemma. ■ 
The next inequality is known and can be found in for example [14]. 

Lemma 2: For two disjoint sets T and A (i.e. F P| A = 0) and all $ for which the RIP holds with s = |F (J A| 

II^F^AYAlb < /3s||yA||2- (10) 

For completeness, we here repeat the proof similar to the one given in [14]. 

Proof: Let O = F |J A. Because the sets F and A are disjoint, the matrix — $p $a is a submatrix of I — $f^$n- 
By [16, Theorem 7.3.9], the largest singular value of a submatrix is bounded by the largest singular value of the 
full matrix. In other words, ||$p$a||2 < ||I — ^n^nlb- To bound the operator norm on the right, note that by the 
RIP, has eigenvalues in the interval [1 — /3s, 1]. Therefore I — has eigenvalues in the interval [0, 

which proves the lemma. ■ 



D. Designing Measuring Systems 

The RIP is a condition on the singular values of sub-matrixes of a matrix <1>. To use the results based on the RIP 
in practice, one would need to either, be able to calculate the RIP for a given matrix, or be able to design a matrix 
with a pre-specified RIP. Unfortunately, none of the above is possible in general. Instead, the only known approach 
to generate a matrix $ satisfying the RIP with high probability, is to use random constructions. For example, if 
the entries in $ are drawn independently from certain probability distributions (such as a Gaussian distribution or 
a Bernoulli distribution), with appropriate variance, then <I> will satisfy RIP j3s < € with probability (1 — e~'^^) if 
M > cs\og{N / s) / , where c and C are constants depending on the distribution of the elements in $. 

Another random construction, which is often more useful in applications, is to use a sub-matrix of an orthogonal 
matrix. For example, let $ be a (suitably normalised) sub-matrix of the matrix associated with the discrete Fourier 
transform, generated by drawing rows of the matrix at random (without replacement), then <^ will satisfy RIP 
Ps ^ ^ with probability (1 — e~'^^) if M > Cslog^ A^log(e~^)/e^, where c and C are constants. 

III. Iterative Hard Thresholding 

A. Definition of the Algorithm 

In [15], we introduced the following Iterative Hard Thresholding algorithm (IHTg). Let yl°J = and use the 
iteration 

y[n+l] ^ ^^^yN ^ ^T^^ _ ^y[n]))^ (11) 

where Hs (a) is the non-Unear operator that sets all but the largest (in magnitude) s elements of a to zero. If there 
is no unique such set, a set can be selected either randomly or based on a predefined ordering of the elements. The 
convergence of this algorithm was proven in [15] under the condition that the operator norm ||^||2 is smaller than 
one. In fact, the same argument can be used to show that the algorithm converges if our version of the restricted 
isometry property holds. The argument follows the same hne as that in [15] and we omit the details here. 



B. Computational Complexity per Iteration 

The iterative hard thresholding algorithm is very simple. It involves the application of the operators $ and <I>^ 
once in each iteration as well as two vector additions. The operator Hg involves a partial ordering of the elements 
of al^l = yN + $^(x — $yN) in magnitude. The storage requirements are small. Apart from storage of x, we only 
require the storage of the vector a, which is of length N. Storage of y["l, which has only s-non-zero elements, 
requires 2s numbers to be stored. 

The computational bottle neck, both in terms of storage and computation time, is due to the operators # and If 
these are general matrixes, the computational complexity and memory requirement is 0{MN). For large problems, 
it is common to use structured operators, basd for example on fast fourier transforms or Wavelet transforms, which 
require substantially less memory and can often be applied with 0{N log M) or even 0{N) operations. In this 
case, the above algorithm has minimal computational requirements per iteration. If C is the complexity of applying 
the operators $ and then the computational complexity of the algorithm is O{k*j0.), where k* is the total 
number of iterations. 
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IV. Iterative Hard Thresholding for Compressed Sensing 

In this section we derive the main result of this paper. We show that if Pas < V^' ^^^^ iterative hard 
thresholding algorithm reduces the estimation error in each iteration and is guaranteed to come within a constant 
factor of the best attainable estimation error. In fact, the algorithm needs a fixed number of iterations, depending 
only on the logarithm of a form of signal to noise ratio. 



A. Digest: The Main Result 

The main result of this paper can be formally stated in the following theorem and corollary. 

Theorem 1: Given a noisy observation x = $y + e, were y is an arbitrary vector. Let be an approximation 
to y with no-more than s non-zero elements for which ||y y''||2 is minimal. If $ has restricted isometry property 
with /33s < 1/8, then, at iteration k, IHTg will recover an approximation y*^ satisfying 



y l|2 < ^ ||y ||2 



+ 5e. 



where 



Furthermore, after at most 



es = l|y-y II2 



y 111 + l|e||2 



iterations, IHTg estimates y with accuracy 



l|y-y 



2 :i 



< 6 



|y ■ 



log2 



y'||2 + ^||y-y*||i + ||e||2 



(12) 
(13) 

(14) 
(15) 



For exact sparse signals, we have a slightly better corollary. 

Corollary 1: Given a noisy observation x = <5y* + e, were is s-sparse. If <I> has the restricted isometry 
property with fi^g < I/85 then, at iteration k, IHTg will recover an approximation y satisfying 



ly^-y^'lb < 2 ''||y^||2 + 4||e||2. 



Furthermore, after at most 



k* = 



log2 



kr* II 

|y II2 

l|e||2 



(16) 
(17) 



iterations, IHTg estimates y with accuracy 



\\y' -y''*h < 5||e||2. 



(18) 



B. Discussion of the Main Results 

The main theorem states that the algorithm will find an approximation that comes close to the true vector. 
However, there is a Umit to this. Asymptotically, we are only guaranteed to get as close as a multiple of 

= ||y - y^l|2 + ^||y - y*||i + ||e||2 (19) 

The quantity can be understood as an error term. This error term is composed of two components, the observation 
error e and the difference between the signal y and its best s term approximation y''. This makes intuitive sense. 
Assume, the observation error is zero and y is exactly s-sparse. In this case, the algorithm is guaranteed (under 
the conditions of the theorem) to find y exactly. For exact sparse signals, but with noisy observation, our success 
in recovering y is naturally limited by the size of the error. Assuming that y is not s-sparse, there will be an error 
between any s-term approximation and y. The closest we can get to y with any s-sparse approximation is therefore 
limited by how well y can be approximated with s-sparse signals. 

To show that this result is in fact optimal up to a constant, we can use the following argument. Theoretic 
considerations due to Kashin [17] and Gamaeva-Gluskin [18] show that any matrix $ G R^^^ must have at least 
M > cslog{N/s) for some constant c in order for the observation x = $y to allow a reconstruction y with 
||y — y||2 < C'/\/s||y||i- As discussed above, a random construction of $ can achieve the RIP required in the 
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theorem with high probabiUty if M > cs log{N/s). This shows that the dependence of the error guarantee of IHTg 
on -i=||y — y*||i is optimal up to a constant factor. Clearly, the dependence on ||y — y'^||2 is also unavoidable, 
as we cannot find any s-term approximation that beats the best s-term approximation, even if y would be known 
exactly. For a general bounded error e, the dependence on ||e|| is also unavoidable, even if the support of y'* would 
be known, the error e, projected onto this subspace, can induce an error depending on the size of e. In summary, 
the error must depend on ||y — y*||2 as there is no s-sparse approximation that could beat this. Furthermore, when 
projecting a signal into an s-dimensional observation space, the maximal information loss must lead to an error of 
the order -i=||y — y**||i. Finally, the worst case estimation error must also depend on the size of the observation 
error e. 

The overall number of iterations required to achieve a desired accuracy depends on the logarithm of li^Ui^. We 

can think of the quantity H^Ji^- as a signal to noise ratio appropriate for sparse signal estimates. This term is large, 
whenever the observation noise is small and the signal y is well approximated by an s-sparse vector. Bounds on 
this term can be derived for a particular signal model, such as, for example, the p-compressible signal model for 
which we have presented the required bounds above. 

C. Derivation of the Error Bound 

We now turn to the derivation of the main result. We start by proving the error bound in corollary [T] Let us 
recall some notation and introduce some abbreviations. We have 

1) X = $y^ + e, 

2) rt"-! = y"* — y^"', 

3) a["+i] = yM + ^>^(x - $y["l) = yN + ^^(^y^ + e - $yW), 

4) y["+^l = //^(at^+^l), where Hs is the hard thresholding operator that keeps the largest s (in magnitude) 
elements and sets the other elements to zero. 

5) r* = supp{y^}, 

6) r" = supp{yW}, 

7) = r*ijr"+\ 

Proof: [Proof of the error bound in corollary [B Consider the error 

||y -y^ ^||2 < llyfin+i - 1|2 + lly^^+i - ^B^+i Il2- UU) 

Note that y^ — y["+^] is supported on the set = r*Ur"+^. By the thresholding operation, y["+^l is the 

best s-term approximation to ai^t+}- In particular, it is a better approximation than y**. This implies that ||yt"^^l — 

[n+ll II ^ II s fn+ll II J , 

a^„+i ||2 < ||y — a^„+i ||2 and we have 



We now expand 
We then have 



||y^-y["+^l||2<2||yW-aSi\'||2. (21) 
atV.l = ygi.. + ^l^.^^v^^^ + <I>S".^e. (22) 



y 



\\2 S ^||yB"+i — y^^+i — S>g„+i<I>r^ J — 5>g„+ie||2 

< 2||rg+, -$^„+i$rW||2 + 2||«>|;„+ie||2 

= 2||(I - ci>B„+OrK+. - <i>i„+.f B.\B"+^rgl^^„,,||2 

+2||^>5„+ie||2 

< 2||(I-ci>|;„+,c|>B"+0rgL||2 

+2||cD|;„+,cI>B„\B„+0rS\B"+Jl2 
+2||$|J„+ie||2 

Now \ 5"+^ is disjoint from 5"+^ and |B"U5"+i| < 3s. We can therefore use the bounds in ©, ^ and 
(0) and the fact that /32s < 

||r["+i]||2 < 2/32s||r["l||2 + 2/33s||r["l||2 + 2||e||2 <4/33.||rW||2 + 2||e||2 (23) 
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If P3s<l, then 

||r["+i]||2 < 0.5||rM||2 + 2||e||2 (24) 
Iterating this relationship, and reahsing that 2(1 + 0.5 + 0.25 + •••)< 4 and that y''^' = 0, we get 

||rW||2 < 2-^'||y^||2 + 4||e||2 (25) 

This proves corollary [T] ■ 
To prove the main theorem, we use the following two lemmas from [14]. The first lemma is stated here without 

proof, which follows that presented in [14], with the required correction for our definition of the RIP. 

Lemma 3 (Needell and Tropp, Proposition 3.5 in [14]): Suppose the matrix $ satisfies the RIP ||<l>y^||2 < ||y^||2 

for all y** : lly'^llo < s, then for all vectors y, the following bound holds 



1 



(26) 



2 < y 2 + ^||y||i. 

Lemma 4 (Needell and Tropp, lemma 6.1 in [14]): For any y, let y'' be (the/any) best s-term approximation to 
y. Let y,. = y - y*. Let 

X = $y + e = ^y'' + $yr + e = ^>y* + e. (27) 



If the RIP holds for sparsity s, then the norm of the error e can be bounded by 



Proof: 



.... ^ .. 1 „ 

6 2 < y - y 2 + ^ y - y 1 + 62 



|e||2 = ||^>yr + e||2 = ||$(y-y'') +e||2 
< ||^(y-y')||2 + ||e||2 
< 



(28) 



1 



y-y^)||2 + ^||(y-y^)||i + ||e||2, 
\/s 



where the last inequality follows form Lamma [3] 

Proof: [Proof of the error bound in theorem [H To bound the error ||y — y^'^' ||2> we note that 



|y-y"^ll2 < 
< 



Ir'^'^lb + I|y-y1l2 

|rW||2 + ||(y-y^)||2 + ^||(y 



|i + l|e||2- 



The proof of the main theorem then follows by bounding Hrl*^! II2 using corollary [T] with e instead of e and lemma 
|4]to bound llelU, that is 



|rW||2 <2-^||y^||2 + 4 



y-y" 2 + ^ y-y" 



1 + eh 



(29) 



D. Derivation of the Iteration Count 

Proof: [Proof of the second part of theorem [H The first part of theorem [1] shows that 

||y-y['']||2<2-'=||y^||2 + 5e„ 



(30) 



where e. 



2 + 



75 1 



1 + eh 



. We are therefore guaranteed to reduce the error to below any 
multiple c of e^, as long as c > 5. For example, assume we want to recover ||y|| with an error of less than 6es. 
This implies that we require that 

(31) 



i.e. that 



2^ > 



y^lb < es 

"trS II 

y II2 



(32) 



which in turn implies the second part of the theorem. The proof of the corresponding result in corollary [T] follows 
the same argument. ■ 
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V. When to Stop 

So far, we have given guarantees on the achievable error and a bound on the total number of iterations to achieve 
this bound. However, in practice, it is necessary to monitor quantities of the algorithm and decide to stop the 
iterations at some point. From the main result, it is clear that in general, we cannot do any better than to find an 
estimate with an error of Se^. However, how do we know that we are getting close to this value? 

A possible stopping criterion is ||x — <i'y["]||2 < £■ For this criterion, it is possible to use the same arguments as 
in appendix A of [14], to derive the following result. 

Lemma 5: Assume that <I> satisfies the RIP with /?3s < 1/8. If at any iteration of IHTg the condition ||x — 
$y["l||2 < e holds, then 

||y-yN||2 < 1.07(e + 2e,), (33) 

where ^ 

= ||y - y^'lb + ^||y - y'||i + ||e||2. (34) 

Conversly, if at any iteration of IHTg the condition 

||y - y'"] lb < e - i/V^lly - l|i - ||e||2 (35) 

holds, then ||x - <I>y["]||2 < e. 

This lemma can be used to calculate a stopping criterion for IHTg. For example, if we want to estimate ||y||2 
with accuracy ce^, we know that we are done as soon as ||x — ^yj^'lb < (c/1.07 — 2)€s- Note that in general, 
IHTs is only guaranteed to work for c > 5, however, as soon as we observe that ||x — ^I^y'^^lb < (c/1.07 — 2), 
we know that the estimation error must be below els- 

Proof: To prove the first part, note that the stopping criterion implies that 

e> ||$(y-y["^) + e||2 = || ^(y^-y '"J) + e||2 

> yr:^||y«-yW||2-||e||2, 

so that 

l|y^-yt"i|b<^#. (36) 

Furthermore, 

||y-y["l||2< l|y'-y'"l||2 + ||y-y^||2, (37) 

so that (using the bound on ||e||2 from lemma IDl 

II fnlii ^ ^ + ll^lb II sii 

||y-y^"l||2 < ,^ '' +||y-y1l2 



< 



< 



e + 2||y - y1|2 + -^||y -y1li + ||e||2 
e + 2e. 



This proves the first part of the lemma using (32s < Pss < 1/8. 
To prove the second part, note that if 

||y - yt"l II2 < e - l/v^||y - y'^^ ||i - ||e||2. (38) 

holds, then 

e > ||y-yt"l||2 + i/y^||y-y["l||i + ||e||2 

> ll<f(y-y'"l)l|2 + ||e||2 

> ll$(y-y'"l)+e||2. 

We have here used lemma |3] to bound ||^>(y — yf"')||2 < ll(y ~ y^"'')||2 + "^ll(y ~ y^"')lli' where s is such that 
RIP holds for this s. 
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VI. Comparison to CoSaMP 

In [14] the authors introduced a subspace pursuit algorithm called CoSaMP, which offers similar guarantees to 
the iterative hard thresholding approach of this paper. The result for CoSaMP is as follows 

Theorem 2 (Needell & Tropp [14]): If $ has the (normal) restricted isometry property with 64s < 0.1, then, at 
iteration k, CoSaMP will recover an approximation y*^ satisfying 

||y'-y''||2<2-'=||y^||2 + 15||e||2 (39) 

if X = $y** + e for y* s-sparse and 

||y-y'=||2<2-'=||y||2 + 20e„ (40) 

if X = $y + e for all y. 

Two remarks are in order. Firstly, for IHTg, we require < 0.125, whilst for CoSaMP, 643 < 0.1 is required. 
Now, P and 5 are related as follows 

therefore, IHTg requires < 0.0667. To compare this to the requirement for CoSaMP, we use corollary 3.4 from 
[14], which states that for integers a and s, Sas < ci62s- Therefore, if 623 < 0.025, the condition for CoSaMP is 
satisfied, whilst for IHTg, we have a comparable condition requiring that 623 < 0.0222. 

Whilst the requirement on S23 is marginally weaker for CoSaMP as compared to IHT3, IHT3 is guaranteed to 
get a roughly four times lower approximation error. For example, in the exact sparse case, IHT3 is guaranteed 
to calculate an error approaching 4||e||2, which should be compared to the guarantee of 15||e||2 for CoSaMP. For 
general signals, the guarantees are big for IHT3 and 206^ for CoSaMP. 

The number of iterations required for IHT3 is logarithmical in the signal to noise ratio. This means that for 
noiseless observations, IHT3 would require an infinite number of iterations to reduce the error to zero. This is a 
well known property of algorithms that use updates of the form y["l + <l>^(x — <J>y["l). CoSaMP on the other hand 
is guaranteed to estimate y with precision 20es in at most 6(s + 1) iterations, however, to achieve this, CoSaMP 
requires the solution to an inverse problem in each iteration, which is costly. IHTg does not require the exact 
solution to an inverse problem. If CoSaMP is implemented using fast partial solutions to the inverse problems, the 
iteration count guarantees become similar to the ones derived here for IHT3. 



VII. What's in a Theorem 

A word of caution is in order. We have here shown that the Iterative Hard Thresholding algorithm has theoretical 
properties, which are comparable to those of other state of the art algorithms such as CoSaMP and has recovery 
guarantees of the same order as £1 based approaches. 

At a first glance, this seems to be at odds with previously reported numerical results [15], which have shown 
that IHT3 does not perform as well as other methods such as Orthogonal Matching Pursuit, for which there are 
currently no comparable performance guarantees. What is more, IHTg does also perform less well in numerical 
studies than CoSaMP or £1 based approaches. 

To understand this disparity, it has to be realised that the uniform performance guarantees derived here for IHTg 
and elsewhere for CoSaMP and ii based methods are worst case bounds, that is, they guarantee the performance 
of the algorithms in the worst possible scenario. Numerical studies on the other hand cannot in general test this 
worst case behaviour. This is because we do not in general know what particular signal would be the most difficult 
to recovered. Numerical experiments therefore analyse average behaviour of the methods, that is, they study the 
recovery of typical signals. 

Whilst uniform guarantees can be derived for the IHTg algorithm, the difference in numerically observed 
performance between this method and other algorithms with similar uniform recovery guarantees indicates that 
uniform guarantees are not necessarily a good measure to indicate good average performance. This is further 
confirmed when looking at the average case analysis of algorithms such as Orthogonal Matching Pursuit [9], for 
which the uniform guarantees are currently available are markedly different to those of IHTg, CoSaMP and ii 
methods [10], whilst in numerical studies, OMP can under certain conditions outperform some of these approaches. 
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VIII. Conclusion 

The abstract made eight claims regarding the performance of the iterative hard thresholding algorithm. Let us 
here summarise these in somewhat more detail. 

• Error guarantee; We have shown that an estimation error of 6||e||2 can be achieved within a finite number of 
iterations. 

• Robustness to noise; We have shown that the achievable estimation error depends linearly on the size of the 
observation error. Performance therefore degrades linearly if the noise is increased. 

• Minimum number of observations; The requirement on the isometry constant dictates that the number of 
observations grows linearly with s and logarithmically with N. Up to a constant, this relation is known to be 
the best attainable. 

• Sampling operator; The algorithm is simple and requires only the application of <I> and <I>^. 

• Memory requirement; We have shown this to be linear in the problem size, if we can ignore the storage 
requirement for <I> and <I>^. 

• Computational complexity; We could shown that the computational complexity is of the same order as the 
application of the measurement operator or its adjoint per iteration. The total number of iterations is bounded 
due to the linear convergence of the algorithm and depends logarithmically on the signal to noise ratio 

lly'lb/llelb- 



log2 



\\r\\: 



iterations, the estimation error is 



• Number of iterations. We have shown that after at most 
smaller than Gis- 

• Uniform performance guarantees. The results presented here only depend on P^s and do not depend on the 
size and distribution of the largest s elements in y. 

This is an impressive list of properties for such a relatively simple algorithm. To our knowledge, only the 
CoSaMP algorithm shares similar guarantees. However, as discussed in section IVlIl uniform guarantees are not the 
only consideration and in practice marked differences in the average performance of different methods are apparent. 
For many small problems, the restricted isometry property of random matrices is often too large to explain the 
behaviour of the different methods in these studies. Furthermore, it has long been observed, that the distribution of 
the magnitude of the non-zero coefficients also has an important influence on the performance of different methods. 
Whilst the theoretical guarantees derived in this and similar papers are important to understand the behaviour of an 
algorithm, it is also clear that other facts have to be taken into account in order to predict the typical performance 
of algorithms in many practical situations. 
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